48 research outputs found

    Load Value Approximation: Approaching the Ideal Memory Access Latency

    Get PDF
    Approximate computing recognizes that many applications can tolerate inexactness. These applications, which range from multimedia processing to machine learning, operate on inherently noisy and imprecise data. As a result, we can tradeoff some loss in output value integrity for improved processor performance and energy-efficiency. In this paper, we introduce load value approximation. In modern processors, upon a load miss in the private cache, the data must be retrieved from main memory or from the higher-level caches. These data accesses are costly both in terms of latency and energy. We implement load value approximators, which are hardware structures that learn value patterns and generate approximations of the data. The processor can then use these approximate data values to continue executing without incurring the high cost of accessing memory. We show that load value approximators can achieve high coverage while maintaining very low error in the application’s output. By exploiting the approximate nature of applications, we can draw closer to the ideal memory access latency. 1

    Power Modeling for Heterogeneous Processors

    Get PDF
    As power becomes an ever more important design consideration, there is a need for accurate power models at all stages of the design process. While power models are available for CPUs and GPUs, only simple models are available for heterogeneous processors. We present a micro-benchmarkbased modeling technique that can be used for chip multiprocessor (CMPs) and accelerated processing units (APUs). We use our approach to model power on an Intel Xeon CPU and an AMD Fusion heterogeneous processor. The resulting error rate for the Xeon’s model is below 3 % and is only 7% for the Fusion. We also present a method to reduce the number of benchmarks required to create these models. Instead of running micro-benchmarks for every combination of factors (e.g. different operations or memory access patterns), we cluster similar micro-benchmarks to avoid unnecessary simulations. We show that it is possible to eliminate as many as 93 % of the compute micro-benchmarks, while still producing power models having less than 10 % error rate

    BlackJack: Secure machine learning on IoT devices through hardware-based shuffling

    Full text link
    Neural networks are seeing increased use in diverse Internet of Things (IoT) applications such as healthcare, smart homes and industrial monitoring. Their widespread use makes neural networks a lucrative target for theft. An attacker can obtain a model without having access to the training data or incurring the cost of training. Also, networks trained using private data (e.g., medical records) can reveal information about this data. Networks can be stolen by leveraging side channels such as power traces of the IoT device when it is running the network. Existing attacks require operations to occur in the same order each time; an attacker must collect and analyze several traces of the device to steal the network. Therefore, to prevent this type of attack, we randomly shuffle the order of operations each time. With shuffling, each operation can now happen at many different points in each execution, making the attack intractable. However, we show that shuffling in software can leak information which can be used to subvert this solution. Therefore, to perform secure shuffling and reduce latency, we present BlackJack, hardware added as a functional unit within the CPU. BlackJack secures neural networks on IoT devices by increasing the time needed for an attack to centuries, while adding just 2.46% area, 3.28% power and 0.56% latency overhead on an ARM M0+ SoC.Comment: 16 pages, 6 figure

    CD-Xbar : a converge-diverge crossbar network for high-performance GPUs

    Get PDF
    Modern GPUs feature an increasing number of streaming multiprocessors (SMs) to boost system throughput. How to construct an efficient and scalable network-on-chip (NoC) for future high-performance GPUs is particularly critical. Although a mesh network is a widely used NoC topology in manycore CPUs for scalability and simplicity reasons, it is ill-suited to GPUs because of the many-to-few-to-many traffic pattern observed in GPU-compute workloads. Although a crossbar NoC is a natural fit, it does not scale to large SM counts while operating at high frequency. In this paper, we propose the converge-diverge crossbar (CD-Xbar) network with round-robin routing and topology-aware concurrent thread array (CTA) scheduling. CD-Xbar consists of two types of crossbars, a local crossbar and a global crossbar. A local crossbar converges input ports from the SMs into so-called converged ports; the global crossbar diverges these converged ports to the last-level cache (LLC) slices and memory controllers. CD-Xbar provides routing path diversity through the converged ports. Round-robin routing and topology-aware CTA scheduling balance network traffic among the converged ports within a local crossbar and across crossbars, respectively. Compared to a mesh with the same bisection bandwidth, CD-Xbar reduces NoC active silicon area and power consumption by 52.5 and 48.5 percent, respectively, while at the same time improving performance by 13.9 percent on average. CD-Xbar performs within 2.9 percent of an idealized fully-connected crossbar. We further demonstrate CD-Xbar's scalability, flexibility and improved performance perWatt (by 17.1 percent) over state-of-the-art GPU NoCs which are highly customized and non-scalable

    Interconnects for DNA, quantum, in-memory and optical computing: insights from a panel discussion

    Get PDF
    The computing world is witnessing a proverbial Cambrian explosion of emerging paradigms propelled by applications such as Artificial Intelligence, Big Data, and Cybersecurity. The recent advances in technology to store digital data inside a DNA strand, manipulate quantum bits (qubits), perform logical operations with photons, and perform computations inside memory systems are ushering in the era of emerging paradigms of DNA computing, quantum computing, optical computing, and in-memory computing. In an orthogonal direction, research on interconnect design using advanced electro-optic, wireless, and microfluidic technologies has shown promising solutions to the architectural limitations of traditional von-Neumann computers. In this article, experts present their comments on the role of interconnects in the emerging computing paradigms and discuss the potential use of chiplet-based architectures for the heterogeneous integration of such technologies.This work was supported in part by the US NSF CAREER Grant CNS-1553264 and EU H2020 research and innovation programme under Grant 863337.Peer ReviewedPostprint (author's final draft

    SigNet: Network-on-Chip Filtering for Coarse Vector Directories

    No full text
    Abstract—Scalable cache coherence is imperative as systems move into the many-core era with cores counts numbering in the hundreds. Directory protocols are often favored as more scalable in terms of bandwidth requirements than broadcast protocols; however, directories incur storage overheads that can become prohibitive with large systems. In this paper, we explore the impact that reducing directory overheads has on the network-on-chip and propose SigNet to mitigate these issues. SigNet utilizes signatures within the network fabric to filter out extraneous requests prior to reaching their destination. Overall, we demonstrate average reductions in interconnect activity of 21 % and latency improvements of 20 % over a coarse vector directory while utilizing as little as 25 % of the area of a fullmap directory. I

    Data Criticality in Network-On-Chip Design

    No full text
    ABSTRACT Many network-on-chip (NoC) designs focus on maximizing performance, delivering data to each core no later than needed by the application. Yet to achieve greater energy efficiency, we argue that it is just as important that data is delivered no earlier than needed. To address this, we explore data criticality in CMPs. Caches fetch data in bulk (blocks of multiple words). Depending on the application's memory access patterns, some words are needed right away (critical ) while other data are fetched too soon (non-critical ). On a wide range of applications, we perform a limit study of the impact of data criticality in NoC design. Criticalityoblivious designs can waste up to 37.5% energy, compared to an idealized NoC that fetches each word both no later and no earlier than needed. Furthermore, 62.3% of energy is wasted fetching data that is not used by the application. We present NoCNoC, a practical, criticality-aware NoC design that achieves up to 60.5% energy savings with no loss in performance. Our work moves towards an ideally-efficient NoC, delivering data both no later and no earlier than needed
    corecore